Importing Libraries¶

In [1288]:
# Import the Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

Data Loading¶

In [1291]:
# load data
df = pd.read_csv("athlete_events.csv")

Data Understanding and Cleaning¶

In [1293]:
# print first 10 columns
df.head(10)
Out[1293]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
0 1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NaN
1 2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NaN
2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NaN
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NaN
5 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 1,000 metres NaN
6 5 Christine Jacoba Aaftink F 25.0 185.0 82.0 Netherlands NED 1992 Winter 1992 Winter Albertville Speed Skating Speed Skating Women's 500 metres NaN
7 5 Christine Jacoba Aaftink F 25.0 185.0 82.0 Netherlands NED 1992 Winter 1992 Winter Albertville Speed Skating Speed Skating Women's 1,000 metres NaN
8 5 Christine Jacoba Aaftink F 27.0 185.0 82.0 Netherlands NED 1994 Winter 1994 Winter Lillehammer Speed Skating Speed Skating Women's 500 metres NaN
9 5 Christine Jacoba Aaftink F 27.0 185.0 82.0 Netherlands NED 1994 Winter 1994 Winter Lillehammer Speed Skating Speed Skating Women's 1,000 metres NaN
In [1294]:
# Print the column names and data types
df.dtypes
Out[1294]:
ID          int64
Name       object
Sex        object
Age       float64
Height    float64
Weight    float64
Team       object
NOC        object
Games      object
Year        int64
Season     object
City       object
Sport      object
Event      object
Medal      object
dtype: object
In [1295]:
# copy the data frame to new data frame for analysis
df_new = df.copy()
In [1296]:
# Check and drop duplicates
print(f"Duplicates: {df.duplicated().sum()}")
df = df.drop_duplicates()
Duplicates: 1385
In [1298]:
# Checking for null values
print("\n Sum of null values: \n")
df.isnull().sum()
 Sum of null values: 

Out[1298]:
ID             0
Name           0
Sex            0
Age         9315
Height     58814
Weight     61527
Team           0
NOC            0
Games          0
Year           0
Season         0
City           0
Sport          0
Event          0
Medal     229959
dtype: int64
In [1299]:
# Fill missing Age values with the median age grouped by Sex, Sport, and Year
df_new['Age'] = df_new.groupby(['Sex', 'Sport', 'Year'])['Age'].transform(lambda x: x.fillna(x.median()))
In [1303]:
# Check if there are still missing Age values
df_new.isnull().sum()
Out[1303]:
ID             0
Name           0
Sex            0
Age            3
Height     60171
Weight     62875
Team           0
NOC            0
Games          0
Year           0
Season         0
City           0
Sport          0
Event          0
Medal     231333
dtype: int64
In [1304]:
# fill in the median age values for the remaining ages
df_new['Age'] = df_new['Age'].fillna(df['Age'].median())
In [1305]:
# Percentage of missing values in original data
missing_height_pct = df['Height'].isnull().sum() / len(df) * 100
missing_weight_pct = df['Weight'].isnull().sum() / len(df) * 100

print(f"Missing Height: {missing_height_pct:.2f}%")
print(f"Missing Weight: {missing_weight_pct:.2f}%")
Missing Height: 21.80%
Missing Weight: 22.81%
In [1306]:
# Fill missing Height and Weight with median values grouped by Sex and Sport in df_new
df_new['Height'] = df_new.groupby(['Sex', 'Sport', 'Year'])['Height'].transform(lambda x: x.fillna(x.median()))
df_new['Weight'] = df_new.groupby(['Sex', 'Sport', 'Year'])['Weight'].transform(lambda x: x.fillna(x.median()))

df_new['Height'] = df_new['Height'].fillna(df['Height'].median())
df_new['Weight'] = df_new['Weight'].fillna(df['Weight'].median())
In [1307]:
# Check if there are still missing Height and Weight values
df_new.isnull().sum()
Out[1307]:
ID             0
Name           0
Sex            0
Age            0
Height         0
Weight         0
Team           0
NOC            0
Games          0
Year           0
Season         0
City           0
Sport          0
Event          0
Medal     231333
dtype: int64
In [1309]:
# if values for age, height and weight are negative, replace them with median
df_new.loc[df_new['Age'] <= 0, 'Age'] = df_new['Age'].median()
df_new.loc[df_new['Height'] <= 0, 'Height'] = df_new['Height'].median()
df_new.loc[df_new['Weight'] <= 0, 'Weight'] = df_new['Weight'].median()
print("Replaced invalid values with median.")
Replaced invalid values with median.
In [1310]:
# Percentage of missing for medal
missing_height_pct = df['Medal'].isnull().sum() / len(df) * 100
print(f"Missing Medal Value in Percentage: {missing_weight_pct:.2f}%")
Missing Medal Value in Percentage: 22.81%
In [1312]:
# fill the missing values with No Medal
df_new['Medal'] = df_new['Medal'].fillna("No Medal")
In [1313]:
# make sure there are no null values
df_new.isnull().sum()
Out[1313]:
ID        0
Name      0
Sex       0
Age       0
Height    0
Weight    0
Team      0
NOC       0
Games     0
Year      0
Season    0
City      0
Sport     0
Event     0
Medal     0
dtype: int64

Additional Cleaning¶

In [1316]:
# rename the Team column to Country
df_new.rename(columns={'Team': 'Country'}, inplace=True)
In [1317]:
# Uniform capitalization with title case
df_new['Country'] = df_new['Country'].str.title()
In [1318]:
# check the columns after the addtional changes - change of columns from Team to Country
df_new.head(10)
Out[1318]:
ID Name Sex Age Height Weight Country NOC Games Year Season City Sport Event Medal
0 1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball No Medal
1 2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight No Medal
2 3 Gunnar Nielsen Aaby M 24.0 171.5 74.0 Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football No Medal
3 4 Edgar Lindenau Aabye M 34.0 175.0 70.0 Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres No Medal
5 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 1,000 metres No Medal
6 5 Christine Jacoba Aaftink F 25.0 185.0 82.0 Netherlands NED 1992 Winter 1992 Winter Albertville Speed Skating Speed Skating Women's 500 metres No Medal
7 5 Christine Jacoba Aaftink F 25.0 185.0 82.0 Netherlands NED 1992 Winter 1992 Winter Albertville Speed Skating Speed Skating Women's 1,000 metres No Medal
8 5 Christine Jacoba Aaftink F 27.0 185.0 82.0 Netherlands NED 1994 Winter 1994 Winter Lillehammer Speed Skating Speed Skating Women's 500 metres No Medal
9 5 Christine Jacoba Aaftink F 27.0 185.0 82.0 Netherlands NED 1994 Winter 1994 Winter Lillehammer Speed Skating Speed Skating Women's 1,000 metres No Medal

Understanding the ranges for Age, Height and Weight¶

In [1320]:
# Calculate statistics for 'Age', 'Height', and 'Weight' including range
stats = df_new[['Age', 'Height', 'Weight']].agg(['mean', 'median', 'std', 'min', 'max'])

# Rename the custom function column for clarity
stats.rename(index={'data_range': 'range'}, inplace=True)

# Print the statistics
print(stats)
              Age      Height      Weight
mean    25.622475  175.127274   70.999576
median  25.000000  175.000000   70.000000
std      6.401233    9.736305   13.390853
min     10.000000  127.000000   25.000000
max     97.000000  226.000000  214.000000

Exploring: This shows that there huge range in the age, height and weight of the participants. However, the mean age remains to be around 25 where most of the atheletes are at their best shape in terms of fitness.


In [1324]:
df_cleaned = df_new.copy()
df_cleaned.to_csv('olympic_data_analysis_cleaned.csv', index=False)

Exploratory Data Analysis (EDA)¶

1. Total Number of Athletes and Participation Over Years¶

In [1329]:
# get the unique numebr of players - as we know atheletes represent mulltiple times from same country
total_athletes = df_new['Name'].nunique()
print(f"Total number of unique athletes: {total_athletes}")
Total number of unique athletes: 134732
In [1331]:
# Get the total number of participants per year per season 
participation_by_year = df_new.groupby(['Year', 'Season'])['Name'].nunique().reset_index()
participation_by_year.columns = ['Year', 'Season', 'Number of Participants']
print(" Number of Participation over Years\n")
print(participation_by_year.to_string(index=False))
 Number of Participation over Years

 Year Season  Number of Participants
 1896 Summer                     176
 1900 Summer                    1220
 1904 Summer                     650
 1906 Summer                     841
 1908 Summer                    2024
 1912 Summer                    2409
 1920 Summer                    2675
 1924 Summer                    3256
 1924 Winter                     313
 1928 Summer                    3246
 1928 Winter                     461
 1932 Summer                    1922
 1932 Winter                     252
 1936 Summer                    4482
 1936 Winter                     668
 1948 Summer                    4402
 1948 Winter                     668
 1952 Summer                    4931
 1952 Winter                     694
 1956 Summer                    3346
 1956 Winter                     821
 1960 Summer                    5348
 1960 Winter                     665
 1964 Summer                    5134
 1964 Winter                    1094
 1968 Summer                    5552
 1968 Winter                    1160
 1972 Summer                    7105
 1972 Winter                    1008
 1976 Summer                    6070
 1976 Winter                    1127
 1980 Summer                    5252
 1980 Winter                    1071
 1984 Summer                    6791
 1984 Winter                    1272
 1988 Summer                    8443
 1988 Winter                    1425
 1992 Summer                    9380
 1992 Winter                    1801
 1994 Winter                    1738
 1996 Summer                   10324
 1998 Winter                    2178
 2000 Summer                   10639
 2002 Winter                    2397
 2004 Summer                   10537
 2006 Winter                    2494
 2008 Summer                   10880
 2010 Winter                    2535
 2012 Summer                   10502
 2014 Winter                    2744
 2016 Summer                   11174

Remarks: There is overall increase in the number of the participation over the years. If you notice there are little bumps in the numbers, that is beacause there are more summer olympics participants than winter olympics.

2. Top Participating Countries¶

In [1335]:
# group by the NOC and unique ids of the participants (we don't want them to repeat)
country_participation = df_new.groupby('NOC')['ID'].count().sort_values(ascending=False).reset_index().rename(columns={'ID': 'Number of Participants'})
print("Top Participating Countries \n")
country_participation.head(15)
Top Participating Countries 

Out[1335]:
NOC Number of Participants
0 USA 18853
1 FRA 12758
2 GBR 12256
3 ITA 10715
4 GER 9830
5 CAN 9733
6 JPN 8444
7 SWE 8339
8 AUS 7638
9 HUN 6607
10 POL 6207
11 SUI 6150
12 NED 5839
13 URS 5685
14 FIN 5467

3. Top 10 Sports with Most Events¶

In [1339]:
# Group by the Sports and IDs of atheletes
top_sports = df_new.groupby('Sport')['ID'].count().reset_index()
top_sports.columns = ['Sport', 'Event Count']
top_sports = top_sports.sort_values(by='Event Count', ascending=False).head(10)

# add ranks for clarity
top_sports['Rank'] = range(1, 11) 

# print out the columns to show Rank first 10
top_sports = top_sports[['Rank', 'Sport', 'Event Count']]
print(top_sports.to_string(index=False))
 Rank                Sport  Event Count
    1            Athletics        38624
    2           Gymnastics        26707
    3             Swimming        23195
    4             Shooting        11448
    5              Cycling        10859
    6              Fencing        10735
    7               Rowing        10595
    8 Cross Country Skiing         9133
    9        Alpine Skiing         8829
   10            Wrestling         7154

4. Athlete Representation in Summer vs. Winter¶

In [1341]:
# groupby season and name of the athletes (only unique values to exclude repetition)
season_representation = df_new.groupby('Season')['Name'].nunique().reset_index()
season_representation.columns = ['Season', 'Athlete Count']
print(season_representation.to_string(index=False))
Season  Athlete Count
Summer         116122
Winter          18923

Observation: This shows that there are more participation in the summer olympics. As mentioned above, there are generally seem to be more participation for Summer Olympics.


Data Visualization and Interpretation¶

1. Medal Trends Over Time by Gender¶

In [1351]:
# Separate the dataset into Summer and Winter Olympics data
summer_data = df_new[df_new['Season'] == 'Summer']
winter_data = df_new[df_new['Season'] == 'Winter']

# Excludes "No Medal" columns and groupby the year, season and sex
medal_trends = df_new[df_new['Medal'] != 'No Medal'].groupby(['Year', 'Season', 'Sex'])['ID'].count().reset_index()
medal_trends.columns = ['Year', 'Season', 'Sex', 'Medal Count']

plt.figure(figsize=(12, 6))
# only incluide the unique values
for season in medal_trends['Season'].unique():
    subset = medal_trends[medal_trends['Season'] == season]
    # further loop through each gender for the current season
    for gender in subset['Sex'].unique():
        gender_subset = subset[subset['Sex'] == gender]
        plt.plot(gender_subset['Year'], gender_subset['Medal Count'], label=f"{season} - {gender}")

plt.title("Medal Trends Over Time by Gender and Season")
plt.xlabel("Year")
plt.ylabel("Number of Medals")
plt.legend(title="Season & Gender")
plt.grid()
plt.show()
No description has been provided for this image

Interpretation and Observation:
In order to understand the trends about the male and female participation over the years, the data can be grouped by the Year, Season and Sex. So, as seen above there is increase in overall female participation over the years since the gap between the males and females for both the season are decreasing. After, the around year 1980 there is rapid rise in the female atheletes for the Summer Olympics. The similar trends can be seen in Winter Olympics after 1984.

2. Heatmap of Medals by Sports and Years¶

In [1355]:
# Group by Decade and Sport
heatmap_data_decade = (
    df_new[df_new['Medal'] != 'No Medal']
    .groupby([(df_new['Year'] // 10) * 10, 'Sport'])['ID']
    .count()
    .unstack(fill_value=0)
)

# Create the heatmap
plt.figure(figsize=(23, 16))
sns.heatmap(heatmap_data_decade, cmap='YlGnBu', linewidths=0.5, linecolor='gray')
plt.title("Heatmap of Medals by Sports and Decades", fontsize=18)
plt.xlabel("Sport", fontsize=18)
plt.ylabel("Decade", fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.show()
No description has been provided for this image

Interpretation and Observation:
From the heatmap, there are can few of the things that can be interpreted:

  • Certain sports like Athletics, Rowing, Football, Ice Hockey, and Swimming have shown increasing popularity, consistently awarding more medals in recent decades compared to the past.
  • Sports such as Gymnastics, Fencing, and Shooting have experienced fluctuating popularity, with noticeable peaks and declines over different decades. This indicates shifts in athlete participation, audience interest, or perhaps changes in event availability over time.

3. Athlete Count vs. Medals Won¶

In [1358]:
# Count unique athletes and medals by country
country_stats = df_new.groupby('NOC').agg({
    'Name': 'nunique',  # Unique count of athletes
    'Medal': lambda x: (x != 'No Medal').sum()  # Count of medals (excluding 'No Medal')
}).reset_index()

# Rename columns for clarity
country_stats.columns = ['Country', 'Athlete Count', 'Medal Count']
plt.figure(figsize=(10, 6))
plt.scatter(country_stats['Athlete Count'], country_stats['Medal Count'], alpha=0.7)
plt.title("Athlete Count vs. Medals Won by Country")
plt.xlabel("Number of Athletes")
plt.ylabel("Number of Medals")
plt.grid()
plt.show()
No description has been provided for this image

Interpretation and Observation:

  • The scatter plot reveals a positive correlation between the number of athletes sent by a country and the number of medals won. This relationship is intuitive, as countries that send larger delegations generally have a higher chance of securing more medals due to increased representation across different events.
  • There are few outliers who have manages to have sent a relatively small number of atheletes but manage to win significant number of medals but they are less of these.

4. Season-Specific Medal Trends¶

In [1361]:
# Filter data for rows with medals
medal_data = df_new[df_new['Medal'] != 'No Medal']

# Group by Season and count medals
season_medals = medal_data.groupby('Season')['ID'].count().reset_index()
season_medals.columns = ['Season', 'Medal Count']
print(season_medals)
   Season  Medal Count
0  Summer        34088
1  Winter         5695
In [1362]:
plt.figure(figsize=(8, 6))
plt.bar(season_medals['Season'], season_medals['Medal Count'], color=['#FFA07A', '#87CEEB'])
plt.title("Total Medals Won: Summer vs Winter Olympics")
plt.xlabel("Season")
plt.ylabel("Number of Medals")
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
No description has been provided for this image

Interpretation/Observation:

  • The bar chart shows a significantly higher number of medals awarded in the Summer Olympics compared to the Winter Olympics. This is largely due to the fact that the Summer Olympics feature a greater number of sports and events, allowing for more athletes to participate and more medals to be awarded.
  • This difference in medal counts is a reflection of larger athlete participation in the Summer Olympics, as there are simply more events and opportunities to compete.

5. BMI Distribution by Sport¶

In [1365]:
# Box plot to see how BMI varies over different sports
plt.figure(figsize=(16, 8))
sns.boxplot(
    x='Sport', 
    y=df_new['Weight'] / (df_new['Height'] / 100) ** 2,  # Calculate BMI directly
    data=df_new, 
    showfliers=False
)
plt.title("BMI Distribution by Sport")
plt.xlabel("Sport")
plt.ylabel("BMI")
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image

Interpretation/Observation:

  • From the data, it’s evident that different sports exhibit distinct BMI characteristics, which is likely due to the varying physical requirements for each sport.
  • Sports such as Weightlifting, Wrestling, and Judo tend to have a wider range of BMI values. This is because these sports have different weight categories, which means athletes can range significantly in size and muscle mass.
  • Sports like Tug-Of-War and Rugby Sevens generally have higher average BMI values. This is because these sports require a lot of strength, muscle mass, and power, which naturally correlates with a higher BMI.
  • In contrast, sports like Rhythmic Gymnastics and Synchronized Swimming tend to have lower average BMI values. These sports demand agility, flexibility, and endurance, which often means athletes maintain a leaner physique.

Based on Demographics¶

Understanding participations based on age¶

In [1369]:
# Histogram to understand distribution of athletes of different age
fig = px.histogram(df_new, x ='Age', nbins=60, title='age distribution')
fig.show()

Interpretation/Observation

Anaylyzing women participation over the years¶

In [1373]:
# Group the data by 'Year' and 'Sex' and count the number of participants
grouped_data = df_new.groupby(['Year', 'Sex']).size().reset_index(name='Count')

# Bar graph to understand distribution of both sexes over the years
fig = px.bar(grouped_data, x ='Year', y = 'Count', 
             color='Sex',
             barmode='group', 
             title= 'Number of Men and Women Participating Each Year',
            labels={'Participants': 'Number of Participants', 'Year': 'Year'})
fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Number of Participants',
    title_x=0.5,
    xaxis_tickangle=45  # Rotate x-axis labels for better readability
)

fig.show()

Understanding Height and Weight based on Sex¶

In [1375]:
# Box plot to show Height based on Sex
fig = px.box(df_new, x='Sex', y='Height', color ='Sex', title='Sex Vs Height')
fig.show()
In [1376]:
# Box plot to show Weight based on Sex
fig = px.box(df_new, x='Sex', y='Weight', color ='Sex', title='Sex Vs Weight')
fig.show()

Based on Teams and Medal wins¶

In [1392]:
# Clean up data to visualize the medal wins
# Drop None values
df_filtered = df_new[df_new['Medal'] != 'No Medal']

# Group by the Team names
df_medal = df_filtered.groupby(['Country', 'Medal']).size().reset_index(name='Count')

# Pivot to create separate columns for each medal type and fill none values with zeros
df_medals_pivot = df_medal.pivot(index='Country', columns='Medal', values='Count').fillna(0)

# Add a Total column
df_medals_pivot['Total'] = df_medals_pivot.sum(axis=1)

# Reset the index to turn it back into a DataFrame
df_medals_pivot = df_medals_pivot.reset_index()
In [1394]:
# Sort by each medal type and extract top 10 teams
top_gold = df_medals_pivot.sort_values(by='Gold', ascending=False).head(10)
top_silver = df_medals_pivot.sort_values(by='Silver', ascending=False).head(10)
top_bronze = df_medals_pivot.sort_values(by='Bronze', ascending=False).head(10)

# Plot Bar to show the top winning teams
grouped_data = df_new.groupby(['Year', 'Sex']).size().reset_index(name='Count')

# Combine the top 10 data for each category
top_combined = pd.concat([top_gold.assign(Medal='Gold'), top_silver.assign(Medal='Silver'),top_bronze.assign(Medal='Bronze')])

# Create a grouped bar chart
fig = px.bar(
    top_combined, 
    x='Country', 
    y=['Gold', 'Silver', 'Bronze'], 
    title='Top 10 Teams by Medal Categories',
    labels={'value': 'Medal Count', 'variable': 'Medal Type'},
    barmode='group'
)

fig.show()

Expected Output¶

Height Distribution of Medalists by Gender¶

In [1398]:
# Male and female heights
male_heights = df_new[df_new['Sex'] == 'M']['Height']
female_heights = df_new[df_new['Sex'] == 'F']['Height']

# Plot histogram
plt.figure(figsize=(8, 5))
plt.hist(
    [female_heights, male_heights],
    bins=20,
    edgecolor='black',
    alpha=1.0,  # No transparency to avoid overlapping colors
    color=['plum', 'lightskyblue'],
    label=['Female', 'Male'],
    stacked=True
)

# Add labels and titles
plt.title("Height Distribution of Medalists by Gender")
plt.xlabel("Height (cm)")
plt.ylabel("Frequency")
plt.legend(title="Sex")
plt.grid(True)
plt.tight_layout()

plt.show()
No description has been provided for this image

Interpretation/Observation:

  • There are significant differences in the height distributions of male and female medalists.
  • Male athletes are, on average, taller than female athletes, with their distribution being centered around a higher mean. The majority of female athletes have a height between 155 cm and 175 cm, with a peak near 165 cm. This reflects the natural height differences between genders and highlights that the majority of female athletes are shorter compared to their male counterparts.
  • The differences in height distributions may be influenced by the types of sports that athletes compete in.
In [1406]:
# Group data by Year and Sex to count participants
gender_representation = df_new.groupby(['Year', 'Sex'])['ID'].count().unstack()

# Plot bar chart
gender_representation.plot(
    kind='bar',
    stacked=True,
    figsize=(12, 6),
    color=['lightblue', 'chocolate'],
    edgecolor='black'
)

# Add title and labels
plt.title("Gender Representation Over Time")
plt.xlabel("Year")
plt.ylabel("Number of Athletes")
plt.legend(title="Sex", labels=["Female", "Male"])
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()

# Show the plot
plt.show()
No description has been provided for this image

Interpretation/Observation:

  • There is rise in number of the females over the years.
  • We can clearly observe the overall number of increase of athletes participations. However, the gap between the number of male and female participants seem to be decreasing.
  • There are significantly more number of participants in Summer olympics than Winter Olympics. The possible reasons could be there are more sports included in the Summer Olympics compared to Winter.

Insights and Generalizations¶

Additional Analysis/Findings (Exploring more)¶

In [1411]:
# Bar plot to show number of male/female representation 
# Group data by Year and Sex to count participants
gender_representation = df_new.groupby(['Year', 'Sex'])['ID'].count().unstack()
gender_representation.plot(
    kind='bar',
    stacked=True,
    figsize=(12, 6),
    color=['lightblue', 'chocolate'],
    edgecolor='black'
)

# Add title and labels
plt.title("Gender Representation Over Time")
plt.xlabel("Year")
plt.ylabel("Number of Athletes")
plt.legend(title="Sex", labels=["Female", "Male"])
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()

plt.show()
No description has been provided for this image

Understand distribution of athletes of different age¶

In [1413]:
# Histogram to understand distribution of athletes of different age
fig = px.histogram(df_new, x ='Age', nbins=60, title='Age Distribution')
fig.show()

Observation

  • The histogram indicates that the age of participants ranges widely, from close to 10 years old to nearly 90 years old. This suggests a diverse set of participants in terms of age, depending on the type of sport and its physical demands.
  • The majority of athletes fall within the 25 to 30-year-old range. This age range likely represents the peak performance years for many athletes, where they have the optimal combination of experience, physical strength, and fitness.

Understanding the participation of each sexes over the years¶

In [1418]:
# Group the data by 'Year' and 'Sex' and count the number of participants
grouped_data = df_new.groupby(['Year', 'Sex']).size().reset_index(name='Count')

# Bar graph to understand distribution of both sexes over the years
fig = px.bar(grouped_data, x ='Year', y = 'Count', 
             color='Sex',
             barmode='group', 
             title= 'Number of Men and Women Participating Each Year',
            labels={'Participants': 'Number of Participants', 'Year': 'Year'})
fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Number of Participants',
    title_x=0.5,
    xaxis_tickangle=45  # Rotate x-axis labels for better readability
)

fig.show()

Observation

  • There are significant number of female participants over the years.
  • There are no female participants before 1900, indicating that women were largely excluded from the early Olympic Games. Female participation began to appear in the early 20th century, as more sports began allowing female competitors, marking a shift towards gender inclusivity in the Olympics.

Note: You will see no participants under years - 1916, 1940 and 1944. Due to World War I & II, the Olympics were postponed.

Analyze relation between Height, Weight and Sex¶

In [1423]:
# Box plot to show Height based on Sex
fig = px.box(df_new, x='Sex', y='Height', color ='Sex', title='Sex Vs Height')
fig.show()

Observation

  • The median height for the males is more than the females. Also, there are good number of males with greater height than female.
In [1425]:
# Box plot to show Weight based on Sex
fig = px.box(df_new, x='Sex', y='Weight', color ='Sex', title='Sex Vs Weight')
fig.show()

Observation

  • The median weight of male athletes is noticeably higher than that of female athletes. This is expected given the natural differences in body composition and the physical demands of many sports where male athletes compete.
  • These observations highlight the differences in physical attributes between male and female athletes, which are influenced by both biological factors and the types of sports they participate in.

Summary¶

Demographics and Participation Trends

  • Increasing Participation Over Time:
    Both male and female participation has grown significantly since the inception of the modern Olympics. The growth in female athletes is particularly notable, with increased representation especially after the 1960s, reflecting changing societal norms and a shift towards gender inclusivity.
  • Seasonal Trends: The Summer Olympics have consistently attracted a larger number of participants and awarded more medals compared to the Winter Olympics, mainly due to a broader range of sports and events.

Medal Distribution Insights

  • Age and Medal Success: The analysis suggests that the majority of medals are won by athletes in their mid-20s to early 30s.
  • Countries with Higher Participation: Countries that send larger delegations tend to win more medals, indicating a positive correlation between the number of participating athletes and medal success. However, outliers exist—certain countries are able to win a substantial number of medals despite sending fewer athletes, likely due to targeted training programs or specialization in particular sports.

Physical Attributes

  • Height Differences: Male athletes are generally taller, reflecting biological factors and sport-specific physical demands.
  • Weight Differences: Male athletes also have higher median weights, especially in sports like weightlifting and wrestling, which require strength and power.
  • BMI Trends by Sport: High BMI is common in power-based sports (e.g., weightlifting, rugby), while lower BMI is typical in agility-focused sports (e.g., gymnastics, long-distance running).

Sports-Specific Trends

  • Growth in Popularity: Sports like Athletics, Swimming, and Football have seen increased participation and a growing medal count over time.

In conclusion, the Olympic Games have evolved significantly since their inception, moving from a predominantly male-dominated arena to one that encourages gender inclusivity and diversity. The growth in both number of sports and athlete participation highlights the expanding appeal and scope of the games.